July 9, 2019

Open Analytics

Data Science Company

Data Science Company

Docker

Welcome

Docker for Data Science: R, ShinyProxy and more

Docker the Word

  • strong and clever man
  • pianos and trucks
  • advent of the container

Docker the Word

Docker the Word

Docker the Technology

  • put software components together and ship it around
  • program - application being executed ‘process’
  • image ‘blueprint’ vs. container: instance of an image that is running
  • OOP: class vs. object

Docker Images

$ sudo docker images
                           TAG     IMAGE ID       CREATED      SIZE
gbmperf_cpu                latest  50e874eb88e1   3 days ago   3.41GB
postgres                   alpine  5e83e6aa7014   12 days ago  70.8MB
openanalytics/rdepot-repo  latest  bc7b067e0170   12 days ago  104MB
openanalytics/rdepot-app   latest  af2ba41e5049   12 days ago  2.82GB
openanalytics/r-base       latest  9e7a835c395e   6 weeks ago  585MB

Docker Containers

$ sudo docker run -it openanalytics/r-base R
$ sudo docker ps 
CONTAINER ID  IMAGE                COMMAND  CREATED         STATUS        PORTS NAMES
3a3c4f4f9d5e  openanalytics/r-base "R"      14 seconds ago  Up 13 seconds       flamboyant_shaw

Isolation of Containers

Kernel namespaces:

  • PID: isolate allocation of process identifiers

  • network: isolated network interface controllers, firewall rules, routing tables

  • mount: file system layout, read-only mount points etc.

  • user: isolation of user ids

  • view from inside, view from inside

Process Isolation

From inside

$ sudo docker ps
CONTAINER ID        IMAGE                  COMMAND             CREATED             STATUS              PORTS               NAMES
3a3c4f4f9d5e        openanalytics/r-base   "R"                 About an hour ago   Up About an hour                        flamboyant_shaw

Get into container and run bash

sudo docker exec -it 3a3c4f4f9d5e bash
top

Process Isolation

From outside:

$ sudo docker container top 3a3c4f4f9d5e
UID   PID   PPID  C    STIME  TTY    TIME      CMD
root  9872  9848  0    09:42  pts/0  00:00:00  /usr/lib/R/bin/exec/R
root  11257 9848  0    10:51  pts/1  00:00:00  bash

No Virtualization

  • isolation purely managed by the operating system
  • operating system is shared by the containers (no guest operating system)
  • no hardware virtualization (as in KVM or VMWare)

Building an Image

  • explicit recipe called the Dockerfile
  • start FROM an existing image e.g. official image of a certain Linux distribution
  • see the openanalytics/r-base Dockerfile

Let’s Do It!

$ sudo docker build -t openanalytics/r-base .
Sending build context to Docker daemon  152.6kB
Step 1/13 : FROM ubuntu:18.04
 ---> 1d9c17228a9e
Step 2/13 : LABEL maintainer="Tobias Verbeke <tobias.verbeke@openanalytics.eu>"
 ---> Using cache
 ---> 8332dc56486d8332dc56486d
Step 3/13 : RUN useradd docker  && mkdir /home/docker   && chown docker:docker /home/docker     && addgroup docker staff
 ---> Using cache
 ---> d2fb24b21f1a
 [...]

Dockerfile DSL Overview

  • ‘build context’: current working directory when issueing the docker build command
  • use .dockerignore to exclude files
  • FROM: base image to start from
  • LABEL: add metadata to the image (e.g maintainer)
  • RUN: command to execute (as root)
  • ENV: environment variable to set
  • COPY: add files from current directory to image
  • CMD: specifies what command to run within the container

Layered File Systems

Inspecting Layers

$ sudo docker history openanalytics/r-base
IMAGE          CREATED      CREATED BY                                      SIZE  COMMENT
9e7a835c395e   6 weeks ago  /bin/sh -c #(nop) CMD ["R"]                     0B                  
53556f13a4fb   6 weeks ago  /bin/sh -c apt-get update  && apt-get instal…   454MB               
7751c94215fd   6 weeks ago  /bin/sh -c #(nop)  ENV R_BASE_VERSION=3.5.3     0B                  
273e01930784   6 weeks ago  /bin/sh -c apt-key adv --keyserver keyserver…   2.38kB              
aed5a9f91923   6 weeks ago  /bin/sh -c echo "deb https://cloud.r-project…   64B                 
0ac26c2bbe68   6 weeks ago  /bin/sh -c #(nop)  ENV LANG=en_US.UTF-8         0B                  
7f7de126a4b8   6 weeks ago  /bin/sh -c #(nop)  ENV LC_ALL=en_US.UTF-8       0B                  
30da9ee49150   6 weeks ago  /bin/sh -c echo "en_US.UTF-8 UTF-8" >> /etc/…   1.69MB              
b4b8698f2896   6 weeks ago  /bin/sh -c apt-get update  && apt-get instal…   42.6MB              
ff46a7c61686   6 weeks ago  /bin/sh -c #(nop)  ENV DEBIAN_FRONTEND=nonin…   0B                  
d2fb24b21f1a   6 weeks ago  /bin/sh -c useradd docker  && mkdir /home/do…   393kB               
8332dc56486d   6 weeks ago  /bin/sh -c #(nop) LABEL maintainer=Tobias V…    0B                  
1d9c17228a9e   6 months ago /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B                  
<missing>      6 months ago /bin/sh -c mkdir -p /run/systemd && echo 'do…   7B                  
<missing>      6 months ago /bin/sh -c rm -rf /var/lib/apt/lists/*          0B                  
<missing>      6 months ago /bin/sh -c set -xe   && echo '#!/bin/sh' > /…   745B                
<missing>      6 months ago /bin/sh -c #(nop) ADD file:c0f17c7189fc11b6a…   86.7MB 

Revisiting the Image

  • tar file
  • image consists of filesystem layers and metadata (stored as JSON files)
  • layer is a collection of changes to files, can be added one after the other
  • Docker image format is laid down in a specification

Union FileSystem

  • file system where you take the union of the layers
  • think ‘pen on paper model’ of base R graphics
  • draw a black line, draw a red line: you only see the red line
  • draw a new dot, you see the new dot

Sharing a Benchmark

git clone https://github.com/szilard/GBM-perf.git
cd GBM-perf/cpu
sudo docker build --build-arg CACHE_DATE=$(date +%Y-%m-%d) -t gbmperf_cpu .
sudo docker run --rm gbmperf_cpu

Reproducibility: Java needed, R packages from Github

Stop the Benchmark

Try Ctrl-C (SIGINT) first…

$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS               NAMES
23a015e33372        gbmperf_cpu         "/bin/sh -c 'cd GBM-…"   About a minute ago   Up About a minute   8787/tcp            quirky_sutherland

$ sudo docker stop 23a015e33372
23a015e33372

$ sudo docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
  • sudo docker stop is friendlier than sudo docker kill

Complex Environment for Statistical Computing

Container Registries

  • where to find and where to publish images? (storage and distribution)
  • repository of images, with an API on top
  • most well known (hosted) container registry is Docker Hub

docker pull

$ sudo docker pull hello-world

[...]

$ sudo docker run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

[...]

docker push

Sample Dockerfile

FROM openanalytics/r-base
CMD ["R", "-q", "-e", "cat('Hello useR!2019')"]

build it

sudo docker build -t openanalytics/hello-user2019

push to the registry

$ sudo docker push openanalytics/hello-user2019
The push refers to repository [docker.io/openanalytics/hello-user2019]
d5742bf4b34d: Preparing 
edf956298918: Preparing 
[...]
c8dbbe73b68c: Waiting 
2fb7bfc6145d: Waiting 
denied: requested access to the resource is denied

docker login

$ sudo docker login
Login with your Docker ID to push and pull images from Docker Hub. If you don't have a Docker ID, head over to https://hub.docker.com to create one.
Username: openanalytics 
Password: 
WARNING! Your password will be stored unencrypted in /home/tverbeke/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

docker push

$ sudo docker push openanalytics/hello-user2019 The push refers to repository
[docker.io/openanalytics/hello-user2019] d5742bf4b34d: Pushed edf956298918:
Pushed 298019512dd9: Pushed d7fd52e782b4: Pushed 55792dcdca25: Pushed 
4835240d7abd: Pushed 2c77720cf318: Mounted from openanalytics/r-shiny 
1f6b6c7dc482: Mounted from openanalytics/r-shiny c8dbbe73b68c: Mounted from
openanalytics/r-shiny 2fb7bfc6145d: Mounted from openanalytics/r-shiny latest:
digest: sha256:c24e2dfd4ca2cd29a7cf0da802a97f1336c1079ab94208ad94edcbe54bd5a349
size: 2408
sudo docker pull openanalytics/hello-user2019
sudo docker run openanalytics/hello-user2019

Other Container Registries

Tags

Images have names, identifiers and tags

$ sudo docker images | head -n 2
REPOSITORY                     TAG     IMAGE ID       CREATED             SIZE
openanalytics/hello-user2019   latest  682c0cc54795   11 hours ago        585MB

Look back at command:

sudo docker build -t openanalytics/hello-user2019

First finding: docker adds this latest tag by default.

$ sudo docker build --help | grep 'tag '
-t, --tag list     Name and optionally a tag in the 'name:tag' format

docker tag

$ sudo docker tag openanalytics/hello-user2019 openanalytics/hello-user2019:0.0.1


$ sudo docker images | grep hello
openanalytics/hello-user2019  0.0.1   682c0cc54795   20 hours ago  585MB
openanalytics/hello-user2019  latest  682c0cc54795   20 hours ago  585MB
hello-world                   latest  fce289e99eb9   6 months ago 1.84kB


$ sudo docker run openanalytics/hello-user2019:0.0.1 

Rocker project

  • beautiful project started by ‘R in Docker’ pioneers Dirk Eddelbuettel and Carl Boettiger
  • versioned stack
    • r-ver: specific versions of R
    • rstudio: adds rstudio
    • tidyverse: adds tidyverse & devtools
    • verse: adds tex & publishing-related packages
    • geospatial: adds geospatial libraries
  • base stack:
    • r-base: latest R release
    • r-devel: development version of R added as RD next to R release R
    • rdr: lightweight version of R-devel, built less regularly
  • additional images: see https://www.rocker-project.org/

tags in rocker project

$ sudo docker pull rocker/r-ver:devel

$ cd ~/git/rdepot-demo/examples

$ sudo docker run -it -v /home/tverbeke/git/rdepot-demo/examples/oaColors_0.0.4.tar.gz:/root/oaColors_0.0.4.tar.gz rocker/r-ver:devel bash

cd /root/
R CMD check --as-cran properties_0.0-9.tar.gz 
* using log directory ‘/root/properties.Rcheck’
* using R Under development (unstable) (2019-07-05 r76788)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* using option ‘--as-cran’
* checking for file ‘properties/DESCRIPTION’ ... OK
[...]

Volume Mounting

sudo docker run -v /path/on/host:/path/in/container:options
  • useful option is e.g. ro for read-only
  • can be a folder and a single file
  • will be created if non-existent
  • can be used to offer persistence to Docker based apps (between ephemeral containers)
  • different types of mounts possible: local mounts, NFS, SSHFS etc.

Tips for Windows

c:/PATH
//c/PATH
/c//PATH 

In case of a SPACE in the path e.g. “Program Files”, the whole path should be in quotes.

Example:

docker run -it -v "C:/Python app/python-app":/src python-app

Docker Compose

Client-Server Architecture

  • client(s) e.g. docker command
  • server: docker daemon dockerd
  • can be on same system or remote system
  • cf. docker build output
Sending build context to Docker daemon 2.048kB

Expose Docker Daemon via TCP

Docker Compose

  • run multi-container applications with connected containers
  • no complex shell scripts and Makefiles (docker run this and docker run that)
  • but YAML based configuration that can be launched in one go

Distributed Modeling Case

docker-compose up

$ sudo docker-compose up
Creating network "distributedutils_distributed-modeling" with the default driver
Creating artemis-server ... 
Creating artemis-server ... done
Creating center2-r-session ... 
Creating center1-r-session ... 
Creating center1-r-session
Creating center2-r-session ... done
Attaching to artemis-server, center1-r-session, center2-r-session
artemis-server    | =========================================================================
artemis-server    | 
artemis-server    |   JBoss Bootstrap Environment
center1-r-session | > source('/root/center_rsession.R', echo=TRUE)
[...]

Peek under the Hood

version: '3'

services:
  artemis-server:
    image: registry.openanalytics.eu/public/artemis-server:latest
    container_name: artemis-server
    ports:
      - "8080:8080"
    networks:
      - distributed-modeling
  
  center1:
    image: registry.openanalytics.eu/public/center-r-session:latest
    container_name: center1-r-session
    environment:
      CENTER_ID: center1
    depends_on:
      - artemis-server
    command: R -q -e "source('/root/center_rsession.R', echo=TRUE)"
    networks:
      - distributed-modeling
[...]

rtq Docker

  • rtq are reliable task queues for distributed architecture
  • container(s) listening for computational tasks on task queues on which tasks are posted
  • uses redis as in memory-data structure store the message queues

docker-compose.yml

rtq-docker Github repository

version: '3'
services:
  redis:
    image: redis
  rtq-worker:
    build: rtq-client
    environment:
        - REDIS_HOST=redis
        - REDIS_PORT=6379
    depends_on:
        - redis
  rtq-producer:
    build: rtq-client
    environment:
        - REDIS_HOST=redis
        - REDIS_PORT=6379
    depends_on:
        - redis

rtq in Docker Demo

sudo docker-compose build

start the redis server and the worker

sudo docker-compose up redis rtq-worker

submit a task to the queue

sudo docker-compose run rtq-producer \ 
  R -q -e 'rtq::createTask(rtq::RedisTQ(redux::redis_config(), "demo"), list(message = "hello!"))'

Scheduled Reporting

Scheduled Reporting

Docker Compose file in dedicated Github repository.

version: '3'
services:
  nginx:
    image: nginx:latest
    hostname: nginx
    restart: unless-stopped
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./rundeck:/etc/nginx/sites-enabled/rundeck:ro
    ports:
      - 80:80
    depends_on:
      - rundeck
    networks:
      - rundeck
  rundeck:

RDepot

  • data scientists need repositories to store their artifacts
  • R packages can be used as R packages, but also to package RMarkdown reports or Shiny apps
  • to manage these in a professional way and set up “internal CRAN” repositories, very few solutions exist
  • one such solution is RDepot (rdepot.io) and it is distributed under the form of Docker images.

RDepot Docker Compose

cd ~/git/rdepot-demo
sudo docker-compose up
  • navigate to http://localhost (admin / admin)
  • create repository
  • upload package
  • install package in R session (full round trip)

Shiny Apps (on ShinyProxy)

ShinyProxy

ShinyProxy

ShinyProxy

Real-World Deployment

Pure R

  • develop application on my laptop and get exactly the same behaviour

Containerize

  • run the process in an isolated context
  • with its own file system and network
  • with constraints (e.g. cpu or memory constraints)
  • using facilities of the underlying operating system

Docker

  • Docker trick: lightweight virtualisation

R inside Docker

Demo

sudo docker run -it -p 3838:3838 \
  openanalytics/shinyproxy-demo R -e "shinyproxy::run_01_hello()"

API to manage containers

API to launch containers…

… and shut down containers

Web Application in Front

  • application logic inside R and the Docker image
  • use enterprise Java for server-side management
    • authentication / authorization
    • management of Docker containers
    • proxying of users to the relevant container
  • use robust reverse proxies (Nginx, Apache) where and when needed

Web Application in Front

Demo

Add Authentication

Demo

  • hans / password
  • peter / password

Configuration

From app to container

Dockerfile

How to build it?

$ sudo docker build -t openanalytics/shinyproxy-template .
Sending build context to Docker daemon  74.24kB
Step 1/11 : FROM openanalytics/r-base
 ---> 9da1c5afb0b1
Step 2/11 : MAINTAINER Tobias Verbeke "tobias.verbeke@openanalytics.eu"
 ---> Using cache
 ---> b37ca7a5c47a
Step 3/11 : RUN apt-get update && apt-get install -y     sudo     pandoc     pandoc-citeproc     libcurl4-gnutls-dev     libcairo2-dev     libxt-dev     libssl-dev     libssh2-1-dev     libssl1.0.0
 ---> Using cache
 ---> 5c0b1258b30e
Step 4/11 : RUN apt-get update && apt-get install -y     libmpfr-dev
 ---> Using cache
 ---> 90a66fd24433
Step 5/11 : RUN R -e "install.packages(c('shiny', 'rmarkdown'), repos='https://cloud.r-project.org/')"
 ---> Using cache
 ---> df73ceafaaf9
Step 6/11 : RUN R -e "install.packages('Rmpfr', repos='https://cloud.r-project.org/')"
 ---> Using cache
 ---> 48c016875303
Step 7/11 : RUN mkdir /root/euler
 ---> Using cache
 ---> 5e8c836a28e4
Step 8/11 : COPY euler /root/euler
 ---> Using cache
 ---> 38e8f9ce4a65
Step 9/11 : COPY Rprofile.site /usr/lib/R/etc/
 ---> Using cache
 ---> 2b9ad2a93091
Step 10/11 : EXPOSE 3838
 ---> Using cache
 ---> 34726eba58bb
Step 11/11 : CMD ["R", "-e", "shiny::runApp('/root/euler')"]
 ---> Using cache
 ---> fedd4a91b9ba
Successfully built fedd4a91b9ba
Successfully tagged openanalytics/shinyproxy-template:latest

Update configuration

proxy:
  
  ...
  
  specs:
  
  ...

  - id: euler
    display-name: Euler's Number
    description: Compute Euler's number in arbitrary precision
    container-cmd: ["R", "-e", "shiny::runApp('/root/euler')"]
    container-image: openanalytics/shinyproxy-template

Walk Through Configuration

Templating

Templating (contd.)

Templating (contd.)

Python apps

Python apps (contd.)

proxy:
  
  ...
  
  specs:
  - id: dash-demo
    display-name: Dash Demo Application
    port: 8050
    docker-cmd: ["python", "app.py"]
    docker-image: openanalytics/shinyproxy-dash-demo
    ...

Notebooks

  • notebooks are interactive web-based ways to share “computational narratives” (RMarkdown plus live recomputation of chunks, i.e. “cells”)
  • typically web interface connected to kernels or shells that perform computation and provide output to be displayed in the web interface
  • Apache Zeppelin is a popular notebook focused on interactive data analytics and a little big data flavour "which brings data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark.

Zeppelin in ShinyProxy

Zeppelin notebooks demonstration inside ShinyProxy

shiny:
  proxy:  
  
  [...]
  
  specs:
  - id: zeppelin
    display-name: Apache Zeppelin
    description: Apache Zeppelin Official Docker
    container-image: apache/zeppelin:0.8.1
    container-volumes: [ "/tmp/zeppelin/#{proxy.userId}/notebook:/zeppelin/notebook", "/tmp/zeppelin/#{proxy.userId}/logs:/zeppelin/logs", "/tmp/zeppelin/conf:/zeppelin/conf" ]
    port: 8080

RStudio IDE in ShinyProxy

See this Github repository

shiny:
  proxy:  
  
  [...]
  
  specs:
  - id: rstudio
    container-image: openanalytics/shinyproxy-rstudio-ide-demo
    container-env:
      DISABLE_AUTH: true
      USER: "#{proxy.userId}"
    port: 8787
    container-volumes: [ "/tmp/#{proxy.userId}:/home/#{proxy.userId}" ]

ContainerProxy

RConsoleProxy

  • connect from IDE to R processes running in containers
  • Docker benefits:
    • specific R versions, specific package libraries, project-specific containers etc.
    • cpu and memory constraints (“give me a session with 16 cores and 64 GB-RAM from the cluster”)
  • currently for the Architect IDE only (snapshot version)

Learn more?

ShinyProxy Demo image

ShinyProxy Template

ShinyProxy Config Examples

Kubernetes and the Cloud

Quid k8s?

Orchestration platform for Docker, i.e. to manage services across hosts and at scale (cluster).

Kubernetes Concepts

  • master node: receives orders on what should be run on the cluster and orchestrates the cluster resources
  • kubelet: service managing pods on a particular node (which has Docker installed)
  • pod: grouping of related containers
  • etcd: key/value store used to store information about the cluster

Cloud

Become cloud-vendor independent:

  • EKS: Elastic Kubernetes Service (AWS)
  • AKS: Azure Kubernetes Service
  • GKE: Google Kubernetes Engine

and allow for (infinite) autoscaling!

Applications

  • Scheduled Reporting on k8s see here.
  • Serving R APIs on k8s see here.
  • ShinyProxy on Kubernetes see here.
  • the sky is the limit…

Conclusions

Conclusions

  • you are all strong and clever people, real dockers
  • Docker offers a lot of opportunities in data science
  • reproducible environments for many tasks (benchmarks, R sessions, Shiny apps, serving APIs etc.)
  • build, ship and run!

Thanks!